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Abstract 

Gene expression analysis aims at identifying the genes able to accurately predict biological parameters 
like, for example, disease subtyping or progression. While accurate prediction can be achieved by means of 
many different techniques, gene identification, due to gene correlation and the limited number of available 
samples, is a much more elusive problem. Small changes in the expression values often produce different 
gene lists, and solutions which are both sparse and stable are difficult to obtain. We propose a two-stage 
regularization method able to learn linear models characterized by a high prediction performance. By 
varying a suitable parameter these linear models allow to trade sparsity for the inclusion of correlated 
genes and to produce gene lists which are almost perfectly nested. Experimental results on synthetic and 
microarray data confirm the interesting properties of the proposed method and its potential as a starting 
point for further biological investigations. 
Matlab code is available upon request. 



1 Introduction 

The extraction of relevant biological information from gene expression data - like disease subtype, sur- 
vival time, or assessment of the gravity of an illness - requires the identification of the list of the genes 
potentially involved in the process, genes which need to be further scrutinized by cross-checking the 
available knowledge or through additional quantification methods. In a typical study, the size of the 
data set is less than a hundred, while the dimensionality of the data may be tens of thousands. As a 
consequence, feature selection, i.e., the identification of the gene signature actually involved in the process 
under study, is a formidable task for which classical statistics (designed to deal with large sets of data 
living in low-dimensional spaces) may not be well suited. 
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A great amount of supervised learni ng techniques have b e en proposed to address the p roblem of feature 
sele ction - see e .g. the recent surveys bv lDudoit et al. [20021 ] . [Guvon and Elisseeff [2003| . I ISaevs et al. [2007f 



and lVert [20071 ] . and the references therein. One usually classifies the methods into three categories: fil- 
ters, wrappers, and embedded methods. Filters implement feature selection through a preprocessing 
step disconnected from the learning phase. Examples of filters are ranking criteria where standard sta- 



tistical tests, correlation, and mu tual information are used to score each feature [Golub et al.. 19991 



Weston et al . 200"ol Forman. 2003 1. Filters are also used as a preprocessing step to reduce the huge 



dimensionality of feature space. The drawback of these approaches is that the selection of features is 
performed beforehand and independently of the specific required prediction or classification task. More- 
over, the selection is often made on a univariate basis, i.e. neglecting the possible correlations between 
the features. 

On the other hand, in wrappers, the re levance of a feature subset is determined according to prediction 
performance of a given learning machine ( [Guvon et al.. 2002LlKohavi and John. 1997liFurlanello et al.. 20031 ] 



and references therein). However, the exploration of all subsets of a high-dimensional feature space is in 
general a very demanding task from the computational point of view. 

Differently from wrappers and filters, where variable selection and training are two separate processes, 
embedded methods present the advantage of incorpor ating feature selection within the construction of the 



classifier or regression model. Besides decision tree s [Breiman et al.. 1984j and boosting methods such as 



the popular Adaboost [Freund and Schapire. 1997| , an appealing new trend has emerged recently namely 
the use of penalized methods in genomics or proteomics. These methods consist in the minimization 
of a well-defined objective function to which a penalty term is added in order to avoid "overfitting" , 
i.e. to provide some form of "regularization" - or equivalently an implicit reduction of the dimension- 
ality of the feature space. A variety of such methods have been proposed in the recent literature and 
differ by the choi c e of the objective fun ction and of the penalty term; for a recent overview, see e.g. 
[Segal et al.. 2003| . iMa and Huang. 2008| and the references therein. 



Particularly interesting are penalties which allow to enforce sparsity of the model, namely to perform 
automatic feature selection by assigning truly zero weights to all but a small number of selected features. 
The most famous example are the £ 1 -type penalties used in the so-called LASSO regression, a name 
coined bv lTibshirani [19961 ] as an acronym for "Least Absolute Shrinkage and Selection Operator" . The 
use of LASSO fo r genomics is also advocated e.g. in the recent papers bv lGhosh and Chinnaivan [20051 ] 
and ISegal [20061 ]. However, a drawback of LASSO regression in the presence of groups of correlated 
feat ures is that the metho d is not able to identify all members of the group. Under the name "elastic 
net" Zou and Hastie [2005 ] have proposed a modification of the LASSO method able to overcome such 



drawback and to identify groups of correlated genes. 

Building on this elastic-net regularization strategy, we propose a two-stage method which produces 
gene signatures able to effectively address prediction problems from high-throughput data like DNA 
microarray. In the first stage, the method learns from the available data a minimal set of genes the 
expressions of which are best suited to accurately predict the biological parameter related to the problem 
at hand. By selecting the model through the combination of two optimization schemes, elastic net and 
regularized least squares, our method leads to a model which, unlike the elastic net alone, is characterized 
by both sparsity and low bias. In the second stage, by varying a suitable parameter, the method is able 
to produce models of increasing cardinality by gradually including genes correlated with the set of genes 
identified in the first stage. 

Being formulated as a convex optimization problem, our learning method has a sound mathematical 
foundation and, moreover, as we will see, the models can be efficiently computed through simple and 
easy-to-implemcnt algorithms. The method relies on a truly multivariate analysis and, in contrast to the 
usual genc-by-gene analyses, does not only rank genes on the basis of their differential expression on the 
samples. The embedded feature selection mechanism takes into account the correlation patterns arising 
from the organization of genes in cooperating networks. 
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We show that the method gives stable results even in the presence of data set of low cardinality. The 
two main features of the proposed method are that it provides nested list of genes and that the genes addi- 
tionally included in the longer lists are correlated with the genes of the shorter lists. Both these properties 
can be very helpful when analyzi ng high-throughput d ata and might shed light on the biological mech- 
anisms under study. As shown by |Pe Mol et al. [20081 ] . instead, the obtained models are asymptotically 
equivalent in terms of prediction accuracy. The choice of which list is the most appropriate is left to the 
molecular biologist and ultimately depends on the underlying question and the available prior knowledge. 

The paper is organized as follows. In Section [2] we describe the method we propose for extracting 
nested lists of relevant genes. In Section [3] we analyze the algorithms we developed to solve the main 
underlying optimization problem and the model selection problem. Experimental results are presented 
and discussed in Section [4j 



2 Our approach 

In this section we first set the notation and review some basic concepts of learning theory which are 
relevant to this research. Then, we present our method and motivate our strategy for model selection. 



2.1 Formulation of the problem 

We assume we are given a set of n examples as input/output pairs. We denote the inputs with 
Xi G X = MP, i = 1, ...,n; in our case the components of the vector x% are the expressions of the p 
probe sets synthesized on the chip for each patient i. We note that n may be about 100 or 1000 times 
smaller than p. The outputs, or corresponding responses, are denoted with yi € y and can be either a 
discrete class label in classification problems (e.g. discriminating between disease subtypes), or a continu- 
ous real variable in regression problems (e.g. a measurement of some biological parameter, survival time, 
or assessment of the gravity of the illness). The problem we face is to find which of the p components 
are needed to predict the response y as accurately as possible from any given input x. In our case the 
model cardinality is known to be much smaller than p, though the complexity of gene regulatory networks 
makes it difficult to determine the number of genes actually involved in the process. 

We restrict our attention to linear functions, or equivalcntly to vectors (3 G MP, modelling the relation 
between x and y as y = (3 ■ x. For simplicity we assume that both x and y have zero mean. As customary 
in learning theory, the given examples pairs are assumed to be drawn i.i.d. from a fixed but unknown 
probability density p(x,y) with (x, y) G X x y. Therefore, if the risk of predicting [3 ■ x instead of y is 
measured by (/? • x — y) 2 , the expected risk of a given model (3, in the least-squares sense, is given by 

E[[3] = f {y-(3-x) 2 p{x,y)dxdy. (1) 
Jxxy 

The goal is to determine a sparse model (3* , i.e. a model of cardinality much smaller than p - that is, a 
vector (3* with only k entries different from zero (with k << p) - for which the expected risk, £[(3*], takes 
on a small value. We recall that the components of the model vector are called regression coefficients or 
weights. 



2.2 Penalized regression methods for learning 



The core of the method we propos e in this paper is the minimization of the objective function recently 
proposed by IZou and Hastie [2005j and which we write as 



(2) 



with X the n x p matrix such that the entry [X]ij is j-th component Xij of Xi and Y the n x 1 vector 
with [Y]i = yi. In order to be consistent with the notation in (fT]) we subtract to the j — th component of 
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Xi the average — y^" Xi.j and to the average — 7^. In other words, data are recentered with respect 
to their center of mass. The first term in © measures the least-squares discrepancy of the model (3 € M. p 
on the n training examples and is the empirical risk - i.e. the empirical counterpart of the expected risk 

©■ 

The second and third terms in enforce uniqueness and numerical stability of the minimizcr 
by penalizing respectively the square of the Euclidean, or £ 2 -norm, \\(3\\ 2 — Y^j=i$ji an d the t 1 - 
norm, \{3\ x = X^=i °f the model vector /3. The nonnegative parameters \i and t are the cor- 
responding requlariza tion parameters. The minimizcr /3 en (/i,T) of l[2|). called the naive elastic net by 



Zou and Hastie [2005l |. trades closeness to the data with the size of the £ 2 - and ^ 1 -norm of the solution. 
Before discussing the use of (J2J in our approach, let us first summarize the main properties of the two 
penalized schemes obtained by setting either fx = or r = in ([2]). 



Ridg e regression Hoerl and Kennard. 1970l Hastie et al.. 2001 1 - also kno wn as regularized least 



squares |Engl et al.. 19961 Bertero and Boccacci. 19981 ] or regularization networks Poggio and Girosi. 1990j 



- avoids overfitting by controlling the size of the model vector /3, measured by its £ 2 -norm. Setting r = 
in the objective function |[5J), one gets this ^ 2 -norm penalized regression. The unique minimizer is then 
a model vector with typically all entries different from zero. The linear computational schemes arising 
from this framework arc easy to implement and produce numerically stable solutions which, for optimal 
values of the regularization parameter, lead to accurate predictions. However, they tend to distribute 
the weights evenly among correlated features and, thus, they are ill-suited for performing feature selection. 



LASSO regression Tibshirani. 1996| avoids overfitting by enforcing sparsity, i.e. by favoring model 



vectors with only a small number of entries different from zero. This scheme is equivalent to ^-norm 
penalized regression and is obtained by setting /i = in the objective function j2]). In applications in 
which the solution is known to depend on a relatively small number of features, LASSO appears to be 
quite appropriate. The minimizer is known to be unique (except for very special configurations of the 
inputs) and stable with respect to noise in the output data yi. However, small changes in the components 
of the input data Xi lead to a different feature selection, typically with no appreciable change in the 
overall expected risk (or accuracy in the performance) of the obtained model. Consequently, when the 
inputs arc affected by noise or the number of examples is small compared to the number of features, the 
selection of the components of the model vector (3 might be driven by random fluctuations. 



Empirical evidence [Zou and Hastie. 20051 ] indicates that the naive elastic net produces stable solu- 
tions, exhibits an interesting grouping effect by selecting correlated features (due to the presence of the 
£ 2 -norm term), but suffers from a quite severe solution bias (due to the shrinkage phenomenon induced 
by the £ 1 -norm term). Moreover, good generalization performances are reported only for large values 
of the fi parameter, case in which the obtained solution is very similar to ridge regression. In order to 
contrast bias and enhance the ability of I -norm of promoting sparse solutions, IZou and Hastie [20051 ] 
proposed to rescale the coefficients and introduced what they called the elastic net. We rely on the naive 
elastic net for our method but, in order to overcome its limitations, we explore a different direction and 
propose an alternative remedy. 

2.3 A two-stage method 

Our method consists of two stages. In stage I we obtain a model with minimal cardinality and small bias 
by selecting the model through the coupling of two optimization procedures. In the first optimization 
procedure we perform gene selection by minimizing ([2]) for a small value of the € 2 -norm parameter ji. 
The second optimization is a regularized least squares and consists in minimizing 
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Y-X/3 



+ A 

2 
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(3) 



in which (3 and X represent respectively the weights vector (3 and the input matrix X restricted to the 
genes selected by the first procedure. The cross-validation protocol which we employ for model selection 
yields relatively large values of r and very small values of A. Consequently, for the optimal parameter 
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pair, the solution obtained from the first optimization selects a small number of genes (typically charac- 
terized by a severe bias) while the regularized least-squares minimization ([3]) restricted to the selected 
genes returns a model capable of more accurate predictions. 



In Stage II we aim at gradually increasing the model cardinality by including genes correlated to the 
minimal set identified in stage I. This result is achieved by running the two optimization procedures of 
the previous stage for increasing values of /Lt, while keeping A and r fixed to the optimal values obtained 
through the cross-validation protocol mentioned above. The resulting one-parameter family of solutions, 
^ 0}i yields lists of relevant genes of increasing cardinality. The experimental results reported in 
Subsection 13.21 show that the obtained lists, for increasing values of /x, are almost perfectly nested. 



Our findings are in line with a recent work De Mol et al.. 2008} in which the minimization of ([2]) is 



shown to yield consistent estimators of the linear model for n —* oo. This means that we can find suitable 
sequences of parameters r„ and \i n = er n tending to zero as n — > oo, such that we have, in probability, 

£ [/3(/x n , r n )] —y inf E for n — > oo. 

Notice that consistency, which is obtained for any value of the parameter e controlling the degree of 
correlation, implies Bayes co nsistency, namely tha t the misclassification error of /3(/i„ , r„) , R[(3] , converges 



to the Bayes risk R*, since (jBartlett et al. [2006| l 



(R[P] - R*) < y/£\p] - ini£. 



2.4 The need for a second optimization procedure 

Leng et al. [20061 ] have shown that if the prediction accuracy is used as a criterion to choose the tuning 
parameter, € 1 -norm penalized regression methods (like LASSO) provide consistent estimates in terms of 
prediction accuracy but not necessarily in terms of variable selection. With the following toy example we 
provide empirical evidence of the fact that very good prediction accuracy and correct variable selection 
can be both achieved by coupling the ^-norm penalized regression method with a second optimization 
step restricted to the selected variables. 

We consider a linear regression model y = x ■ [3* + e in which the samples x and the model /3* belong 
to R 1000 , y £ R, and e represents the noise. The training and the validation sets are built by randomly 
drawing 50 and 1000 samples, respectively, from a uniform distribution between [—1, 1] for each of the 
1000 components. The true model j3* has only the first three components different from while the noise 
is sampled from a zero-mean Gaussian distribution with standard deviation a = 0.5. 

For simplicity, we set fi = X = and compare the results obtained with 

• (a) ^-norm penalized regression (LASSO) alone, and 

• (b) £ 1 -norm penalized regression followed by ordinary least squares on the selected features. 

The error curves in Figure [1] show how procedure (b) allows to reach a lower minimum than (a). The 
validation error is taken to be the mean-square error. The minimum is clearly reached for a larger value 
of r, i.e. when the £ -norm penalized regression algorithm selects a lower number of features. 

Let us now compare the estimators (3 a and (3 b obtained, respectively, through (a) with r = r a and 
through (b) with t = r b (these parameters minimize the corresponding validation error). In Table [1] we 
report the first three components of (3*,(3 a ,(3 b . The last column /3 C , represents the output of £ 1 -norm 
penalized regression with r = r 6 . From Tabic [1] we can see that both /3° and f} b approximate well 
the relevant components of (3*. However, while j3 b correctly selects the model, f3 a has many non-zero 
components besides the first three. This is due to the £ -norm penalized regression which, in order to 
reduce bias, induces an optimal choice of r smaller than the one needed to correctly identify the model. 
This can also be seen from the fact that, whereas the ^-norm penalized regression followed by ordinary 
least squares selects the correct features (the model (3 b has only the first three components different from 
zero) and returns almost perfect feature weights, the estimator (3 C - obtained for r = r b - underestimates 
all three coefficients. We can thus conclude that the model selection obtained by coupling the two 
optimization procedures appears to be more effective. 
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Figure 1: Validation error for (a) € 1 -norm penalized regression, and (b) ^ 1 -norm penalized regression 
followed by ordinary least squares for different values of the penalized regression parameter r. 

3 Algorithmic aspects 

In this section we discuss several algorithmic aspects of our work. We first present an iterative algorithm 
for estimating the elastic net solution, and then describe the details of our procedure for model selection 
which, even for data sets of small size, requires a very large number of optimization cycles. Finally, we 
provide empirical evidence of the correctness of a procedure capable of obtaining almost exactly the same 
lists of genes with a large reduction of computing time. 

3.1 Damped iterative thresholding 

In order to minimize l[2]). w e use an algorithm which generalizes the following Landweber or gradient- 
descent iterative procedure Engl et al.. 19961 ]. known to converge to a minimizer of the unpenalized 



least-squares objective function A(/3) = — \\Y — -V/3|| 2 : 

f3V+V=(3V) + hx T Y-X T Xp% 1 = 0,1,.... (4) 

where X T denotes the transpose of X and the constant 2C is a strict upper bound for the spectral norm 
of the matrix X T X : \\X T X\\ < 2C. 



Inspired by the iterative thresholding algorithm proposed bv lDaubechies et al. [2004 for pure ^ 1 -norm 



penalized regression, we propose a double modification of Landweber algorithm which provably converges 
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RELEVANT COMPONENTS OF A LINEAR MODEL IN K 1000 IN A TOY PROBLEM ESTIMATED 

BY DIFFERENT REGRESSION TECHNIQUES. 



0* 


(5 a 


P b 


(3 C 


0.6449 


0.5667 


0.6705 


0.3912 


0.8180 


0.7389 


0.8106 


0.6388 


0.6602 


0.5785 


0.6794 


0.4210 



Tabic 1: From left to right: the true weights of the only three non-zero components of the model (/?*), 
the corresponding weights obtained with ^-norm penalized regression with r = r a (/3 a ), with ^-norm 
penalized regression followed by ordinary least squares with t = r b ((3 b ), and with ^-norm penalized 
regression with r = r b (f3 c ). 



to the minimizer of i[2]). The first modification amounts to applying a soft-thresholding operator S nT /c 
at each iteration. The soft-thresholding operator S Q acts on a vector component-wise as follows 

PaWll - \ o if \p.\ <a /2. W 

This operation enforces the sparsity of the regression coefficients in the sense that all coefficients below 
the threshold a/2 are set to zero. The second modification is a simple multiplication which leads to the 
following damped iterative thresholding scheme: 

^ +1) = ^S^(/3S+ ^[X T Y-X T X^]). (6) 
We recover the cases of ridge regression and damped Landweber iteration for r = 0, wh ereas pure l 1 - 



regularization and the iterative thresholding scheme considered in lDaubechies et al. [2004| correspond to 
the special case fi = 0. For r = /.t = 0, we get the original Landweber iteration (TJJ. The convergence of 
((6|) - for ji > and any initial vector /3™ - to the minimi zer of (|2l) is a straight forward consequence of Ba- 



nach's fixed point theorem for contractive mappings (see lDe Mol et al. [2008 1 for an extensive discussion 
of the properties of this algorithm in a broader setting). 

Since the proposed scheme is iterative we need to define a stopping rule. We first tried to use a fixed 
tolerance 5, letting the iterations stop if — Pi^U < ^l/^h, for all k. However, after extensive 

experimentation on toy examples and real data, we empirically observed that a tolerance depending on 
the number of iterations is more efficient and equally easy to implement. As shown in Figure^ if the 
algorithm stops when the relative change of each coefficient /?[ is smaller than a tolerance S = 0.1/1, 
with I the number of performed iterations, the support of the selected features appears to be stabilized. 



3.2 Model selection procedure 

In all of the performed experiments the training sets were recentered as described in Subsection 12.21 and 
the test sample (either of the validation or of the test set) was recentered with respect to the center of 
mass obtained from the training set. 

The data set is initially divided in training and test set. The training set is further partitioned in 
k subsamples X\, . . . ,X^ with k depending on the cardinality of the training set. In Stage I, for each 
subsample JQ, a classifier is first built using as training set the remaining k — 1 subsamples with r and 
A ranging on a grid in the parameters space, and then validated on X^. Each classifier is built by mini- 
mizing the objective functions ([2]) and ([3]) with the current values of t and A and a fixed small value for 
[i (typically /i = 10~ 6 ). For each parameter pair the validation error is estimated as the average error 
over the k subsamples. Finally the optimal parameter pair, (r op t, X op t), is selected as the minimizer of 
the validation error. 
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Figure 2: Coefficient path of the relevant genes obtained from a real data set vs. the number of iterations 
of the scheme (j6]) on a logarithmic scale (the vertical dashed line corresponds to the stopping rule). 



In Stage II a family of classifiers is built on the entire training set with t = T opt , A = X op t and for m 
increasing values of fi. Along with a test error each classifier returns a list of variables indexed by the 
value of fx. A pseudo-code version of this procedure is summarized in the box below. 



Given 

-(X,Y) training set, and (X test , Y test ) test set 
-{(X u Yx), . . . , (X k , Y k )} partition of (X, Y) 
-/x < /ii < • • • < /x m _i 
Stage I 

-let /i = no, (r t , Xi)teT.iec a grid in parameter space 
-for t e T and Z g £ 
for i = 1 to fc let 

X* r := X\, Xfc 
Y/' := Yi, ...Y-ii Y+ii ... , Yk 

fl(t, I, i) := classifier built on (Xf r , Y/ r ) for r = r t , = /z , and A = A; 
Err(t, I, i) := error made by /3(i, I, i) on (X^, Y) 
end 

Efr&l) ■= \Y, k i=i E rr{t,l,i) 
end 

Stage II 

-find {T opt ,X opf ) minimizing Err{t,l) 
-for i = to m — 1 let 

/?* := classifier built on (X, Y) for r = T opt , /i = and A = A op t 

Err\ est := error made by /?*. on (X* est , Y test ) 
end 



For small values of fj,, the solution of the damped iterative thresholding scheme (J6j) - first step for the 
construction of each classifier /3{t, I, i) in Stage I - requires a very large number of iterations. Consequently, 
the procedure for determining the optimal values of r and A in Stage I, procedure which must be repeated 
k x \T\ x \C\ times with [i = 10~ 6 , is quite time-consuming. In order to speed up the entire process, we 
explored a different approach in which, for each value of r and A, a series of damped iterative thresholding 
schemes are run for 10 decreasing values of /z (fix = 10~ 3 , fXio = 10~ 6 ), the i-th scheme being restricted 
to the variables selected in the (i — l)-th scheme with i = 2, 10. Extensive experiments on synthetic 
and real data indicate that the features selected through this alternative approach are almost always the 
same as those obtained with the procedure described in the original scheme, but with about a 100-fold 
reduction in computing time. For example Table [2] shows the results we obtained on three data sets 
of patient microarrays which we analyze in Subsection 14.31 As it can easily be verified by inspection 
of Table [2] the group of genes selected with the parameter pair (r op j,/x) is almost perfectly enclosed in 
the group selected with the parameter pair (r op t,/i) for p, > /i. Therefore, we decided to implement our 
procedure with the optimization described above. Notice that, by doing so, the nesting of the obtained 
lists is always perfect, since the optimization for fi — fii is restricted to the variables selected for fi = 



4 Results and discussion 

In this Section we report and discuss the results we obtained by running our method on both synthetic 
and real data. Real-data experiments encompass the analysis of both highly purified cell lines grown in 
laboratories and samples from patients' tissues. 



4.1 A toy problem 

We first applied our method on a toy example generated according to scenario (d) in Zou and Hastie [2005j , 



where the relevant features are known in advance. The problem is close to real gene expression data con- 
ditions in that it encompasses both dependence on more than one variable and intra- variables correlation, 
though in a setting of much lower dimensionality. 

We consider a set of n = 100 toy-patients. To each patient i we associate a 40-dimcnsional vector 
Xi built as follows. We divide the 40 components in four groups. Group G\ consists of the first five 
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NESTING PROPERTY OF THE PROPOSED METHOD. 



n 




Leukemia 
A 


B 


/" 




Lung C. 
A 


B 


M 


Prostate C 
A 


B 









30 


100% 









37 


97% 







24 


100% 


6 • 


ID" 


-0 


35 


100% 


2 • 


io- 


-7 


3G 


83% 


10 


-6 


2G 


100% 


10 


— 5 




38 


100% 


3 • 


10" 


-6 


54 


94% 


5 • 


10~ 6 


3G 


100% 


3 • 


10 


-5 


50 


100% 


9 • 


10" 


-6 


79 


99% 


3 • 


10~ 5 


61 


98% 


5 • 


10 


-5 


57 


100% 


10 


-5 




98 


98% 


4- 


10~ 5 


73 


100% 


10 


-4 




77 


100% 


3 • 


io- 


-5 


152 


100% 


6 • 


10~ 5 


89 


99% 


2 • 


10 


-4 


119 


100% 


5 • 


10" 


-5 


182 


100% 


10 


-4 


123 


98% 


4- 


io- 


-4 


144 




7- 


io- 


-5 


218 




4- 


io- 4 


224 





Table 2: The leftmost column contains the values of the parameter [i, while for each disease column A 
contains the number of selected genes and B the percentage of genes present in the gene set selected with 
the next larger value of [i and with r fixed at its optimal value. 



components x^i, ar^g which are obtained by randomly drawing a number Z\ from a zero-mean Gaussian 
distribution with a = 1 and adding to it a noise term tj, j = 1, 5, randomly drawn from a zero-mean 
Gaussian distribution with a = 0.01. For each Xi.j wc thus have 

Xi,j = Z\ + ej, for j = 1, 5. 

The second and third five components, belonging to the groups G2 and G3 respectively, arc built similarly 
and read 

= z 2 + Cj, for j = 6, 10 

Xi,j = Z 3 + ej, for j = 11, 15. 

The fourth group G4 consists of the remaining 25 components randomly drawn from a zero-mean Gaussian 
distribution with a = 1. By construction each of the groups G\ 1 G2, and G3 consists of five equivalent 
variables and the true model (of all possible models the one with largest cardinality) is written as 

= (i 1 i^,o 1 (W)). 

15 25 

For small values of \i the method is thus expected to select one variable from each of the G\ , G2 , and G3 
groups, while for increasing values of /1 it should also include the remaining 12 variables. All the variables 
of the group G4, instead, should be discarded independently of \x. 



We first run stage I of the method by setting fj, = (i.e. perform a LASSO regression) and repeat the 
experiment over 50 data sets. Each data set was split in training and validation set and the parameters 
r* and A* were chosen as the ones minimizing the error on the validation set. The method selects a 
correct model (one variable from each of the three groups Gi, G2, and G3) about 60% of the times and 
a slightly redundant set (one extra variable from any of the three groups) about 20% of the times. 

We then run stage II with [i = 1000 • r*. The frequency histogram of the number of selected features 
is shown in Figure [21 left. As expected, the histogram is peaked in correspondence of 15, the maximum 
number of relevant variables. The fact that the ratio between the number of the relevant features selected 
by the model and the number of features selected by the model is peaked around 1 (see Figure [3l right) 
confirms that most of the times the selected features belong to the correct model. 



4.2 Cell-culture microarray data 

W e now analyze the RAS data set used in Bild et al. [2006| and available on line at |http : //data . genome . duke . edu/oncog~ 
In lBild et al. [2006j human primary mammary epithelial cell cultures (HMECs) were used to develop a 
series of pathway signatures related to different oncogenes. In short, cells were infected with different 
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Figure 3: Number of variables and accuracy of the model selected in a toy problem for /i = 1000 -r*; (left) 
frequency histogram of the number of variables of the obtained model over 50 trials; (right) frequency 
histogram of the ratio between the number of variables belonging to either G\, G2 or G3, and the overall 
number of selected variables. 



adenoviruses for eighteen hours and signatures were extracted as the set of genes most correlated to the 
classification of HMEC samples into oncogene-activated versus control. In order to test our method for 
gene selection, we applied our protocol on a subset of 20 HMEC samples, comprising 10 controls and 
10 samples infected with adenovirus expressing activated H-Ras, thus extracting an alternative pathway 
signature for RAS oncogene. The classification task concerned with this data set is trivial since most 
classification algorithms can easily discriminate between the two classes. In this case, however, we are 
not interested in the classification performance on an independent test set, but in the selection of relevant 
gene lists and in their hierarchical structure. We thus apply our method to the RAS data and report the 
heat maps of the selected gene lists in Figure SJ For fi — 0, the method extracts a minimal set consisting 
of two probe sets of RAP1A (Figure [5J left), a gene belonging to the RAS oncogene family, whereas for 
increasing values of fx, the method selects perfectly nested larger sets of genes (probe sets) correlated or 
anti-correlatcd with the first two, but with lower fold change. In Figured! middle and right, we show the 
results obtained for [i = 0.05 and fi = 0.5 respectively (corresponding to 15 and 144 genes). 

From the obtained results we can see that the method selects nested groups of genes which appear 
to be relevant to the RAS status, nicely sorted b y their differentia l expressions. The two minimal probe 



sets are not part of the RAS signature defined in [B ild et a l.. 200a , whereas in the larger gene lists about 
80% of the genes overlap with those found in [Bild et al.. 2006| (12 out of 15 and 112 out of 144). 



4.3 Patient-tissue microarray data 

Finally, we carried out experiments on data sets relative to three diseases: leukemia, lung cancer and 
prostate cancer. These gene expre ssion data sets are available on line and concern classification problems. 



The first one is the Golub data set Golub et al.. 19991 ] (http : //www .broad .mit . edu/cgi-bin/ cancer/datasets . 



which comprises expressions of 7129 probe sets for 72 patients (samples) divided in two classes according to 
the diagnosi s of Acute Myeloid L eukemia (AML) or Acute Lymphoblastic Leukemia (ALL). The lung can- 
cer data set Gordon et al.. 20*02] ] (|http : //www . chestsurg . orgj) consists of 181 samples with 12533 probe 



sets and each patients is either affecte d by malignant pl eural mesothelioma (MPM) or adenocarcinoma 
(ADCA). Finally, the prostate data set [Singh et al.. 2002j | ( |http : //www- genome . wi mit . edu/mpr/prostatep 
consists again of 12533 probe sets for 102 samples, tumor or normal tissue. In all cases, the vector Y is 
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Figure 4: Heat maps from cell cultures data. Image intensity display of the expression levels of the 2 
(left), 15 (middle) and 144 (right) genes selected by our method for // = 0, 0.05, and 0.5 respectively. 
Expression levels are standardized to zero mean and unit variance across samples, displayed with genes 
as rows and samples as columns, and colour coded to indicate high (red) or low (blue) expression levels. 



formed by labels +1 or —1 distinguishing the two classes. 

We carried out our experiments through leave-one-out (LOO) cross-validation on the Leukemia data 
(given the small number of available samples), and 10-fold cross-validation on both the Lung and Prostate 
Cancer data. A first indicator of the effectiveness of the proposed method is the stability of the various 
gene lists obtained in the training phase. In Figure [5j [6l and [7] we report the number of selected genes 
versus the selection frequency for different values of the parameter fi. By inspecting Figures [5l [51 and 
[7J one sees that the produced gene lists are remarkably stable. For increasing values of \i the number of 
genes appearing in all of the lists ranges from about 1/3 to about 1/2 of the average number of genes, 
while the number of genes appearing in at least 50% of the lists is very close to the average. 

The accuracy of the method on the three data sets (which should remain the same for the different 
values of /i) is illustrated in Table [31 By inspection we can see that for each disease and different values 
of /i (column A) the model cardinality from top to bottom increases (column B) while the prediction ac- 
curacy on the test set (column B) remains quite stable. For each disease in column B errors are reported 
for the two classes separately. The rightmost column C gives the percentage of samples which have to be 
rejected for both classes in order to reach 100% classification rate. 

The rejection region corresponding to fi = for the three diseases is depicted in Figure [HI The solid 
line gives the decision boundary, while the dashed lines mark the rejection region needed to reach the 
perfect score. No rejection region is needed for the Leukemia study (Figure [5J left), a one-sided rejection 
region for the lung cancer study (Figure[51 middle) and a wider two-sided rejection region for the prostate 
cancer case (Figure right). 

An improvement in prediction accuracy is not the aim of the proposed method. However, it is inter- 
esting to notice that the proposed method reaches performances which are at least as good as and often 
better than those reported in the original studies. In the leukemia original paper [Golub et al. 19991 ] . 
a 50-gcnes classifier is built which scored 100% on the test set, though only 29 of the 34 test samples 
corresponded to strong prediction (i.e. prediction with a high confidence level). Th e prediction accuracy 
of our method ranges from 91% to 100%. As for the lung cancer data analysis in lGordon et al. [2002| . 
different classifiers were reported with prediction accuracy ranging from 91% to 99%, t o be compared with 
the 99% achieved with our algorithm. In the end, for the prostate cancer data set, in lSingh et al. [2002j | 
- after gene ranking with variation of a signal-to-noise metric - a fc-NN algorithm obtained a prediction 
accuracy ranging from 82.9% to 95.7% depending on the number of genes used (4 or 6); with our method 
the accuracy ranges from 92% to 96%. 
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Figure 5: Cumulative number of selected genes versus selection frequency in LOO cross-validation for 
Leukemia data. 
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Figure 6: Cumulative number of selected genes versus selection frequency in 10-fold cross-validation for 
Lung Cancer data. 
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Figure 7: Cumulative number of selected genes versus selection frequency in 10-fold cross-validation for 
Prostate Cancer data. 
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Figure 8: Rejection region for /i = 0. In Leukemia (left) the score is perfect and the region is degenerate, 
in Lung Cancer (middle) is one-sided and delimited by the dashed line, in Prostate Cancer (right) is two 
sided. 



14 



PREDICTION ACCURACY OF THE PROPOSED METHOD ON MICROARRAY DATA SETS. 
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Tabic 3: For each of the three diseases the first column contains the values of the parameter pi, the column 
A the number of selected genes, the column B the number of misclassificd samples for the two original 
classes respectively, and the column C the percentage of samples to be rejected in each predicted class in 
order to obtain 100% classification rate. The two classes are (ALL, AML) for Leukemia, (MPM,ADCA) 
for Lung Cancer, and (normal, tumor) for Prostate Cancer. 



Where available (leukemia and lung cancer), we have compared the gene lists we obtained with the 
lists produced by other methods. The results show partial superposition (depending on pi) as well as 
important differences. The difference between our results and the ones reported in the original papers 
is not surprising given the multivariate flavor of our selection procedure. Ultimately, only biological 
validation can assess the actual relevance of the gene lists obtained by different methods. 

5 Conclusion 

In this paper we have proposed and analyzed a two-stage method able to select nested groups of rele- 
vant genes from microarray data. The first stage establishes a minimal subset of genes relevant to the 
classification or regression task under investigation. The second stage produces a one-parameter family 
of groups of genes, showing a remarkable nesting property and similar performance in terms of classifica- 
tion/prediction tasks. In several problems the ability of returning nested list of relevant genes is a key to 
establish the biological significance of the obtained results and is often regarded as the most precious in- 
formation for further investigation based on biological knowledge and subsequent experimental validation. 

In both stages the method consists of an initial step in which a certain amount of genes is selected 
through the minimization of the objective function @ by means of a convergent damped iterative thresh- 
olding algorithm and of a second step in which the weights of the selected genes are refined through ridge 
regression. 

In the first stage, the i 1 parameter t and the regularization parameter A of the subsequent ridge 
regression are estimated from the data by cross-validation, while the £ 2 parameter pi is set to a small 
value. This leads to a solution characterized by a minimal subset of genes. In the second stage, the i 1 
and the ridge parameter are kept fixed to their estimated optimal value and a one-parameter family of 
solutions is generated for increasing values of the £ 2 parameter pi. In the proposed scheme the role of this 
£ 2 parameter can be thought of as a way of controlling the trade-off between sparsity and correlation in 
the solution vector. 

The results which wc obtained on several data sets, including cell-culture and patient-tissue microarray 
data confirm the potential of our approach. 
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